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Abstract 

We propose a sample-efficient alternative for importance weighting for situations where one only 
has sample access to the probability distribution that generates the observations. Our new method, 
called Geometric Resampling (GR), is described and analyzed in the context of online combinatorial 
optimization under semi-bandit feedback, where a learner sequentially selects its actions from a 
combinatorial decision set so as to minimize its cumulative loss. In particular, we show that 
the well-known Follow-the-Perturbed-Leader (FPL) prediction method coupled with Geometric 
Resampling yields the first computationally efficient reduction from offline to online optimization 
in this setting. We provide a thorough theoretical analysis for the resulting algorithm, showing that 
its performance is on par with previous, inefficient solutions. Our main contribution is showing 
that, despite the relatively large variance induced by the GR procedure, our performance guarantees 
hold with high probability rather than only in expectation. As a side result, we also improve the 
best known regret bounds for FPL in online combinatorial optimization with full feedback, closing 
the perceived performance gap between FPL and exponential weights in this setting. 

Keywords: online learning, combinatorial optimization, bandit problems, semi-bandit feedback, 
follow the perturbed leader, importance weighting 


1. Introduction 

Importance weighting is a crucially important tool used in many areas of machine learning, and 
specifically online learning with partial feedback. While most work assumes that importance weights 
are readily available or can be computed with little effort during runtime, this is often not the case 
in many practical settings, even when one has cheap sample access to the distribution generating 
the observations. Among other cases, such situations may arise when observations are generated 
by complex hierarchical sampling schemes, probabilistic programs, or, more generally, black-box 
generative models. In this paper, we propose a simple and efficient sampling scheme called Geometric 
Resampling (GR) to compute reliable estimates of importance weights using only sample access. 

Our main motivation is studying a specific online learning algorithm whose practical applicability 
in partial-feedback settings had long been hindered by the problem outlined above. Specifically, 
we consider the well-known Follow-the-Perturbed-Leader (FPL) prediction method that maintains 
implicit sampling distributions that usually cannot be expressed in closed form. In this paper, we 
endow FPL with our Geometric Resampling scheme to construct the first known computationally 
efficient reduction from offline to online combinatorial optimization under an important partial- 
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Parameters: set of decision vectors S C {0, l} d , number of rounds T; 

For all t = 1,2,... ,T, repeat 

1. The learner chooses a probability distribution p t over S. 

2. The learner draws action V randomly according to p t . 

3. The environment chooses loss vector £ t . 

4. The learner suffers loss V t T £t- 

5. The learner observes some feedback based on £ t and V- 

Figure 1: The protocol of online combinatorial optimization. 


information scheme known as semi-bandit feedback. In the rest of this section, we describe our 
precise setting, present related work and outline our main results. 

1.1 Online Combinatorial Optimization 

We consider a special case of online linear optimization known as online combinatorial optimization 
(see Figure 1). In every round t = 1, 2, ..., T of this sequential decision problem, the learner chooses 
an action Vt from the finite action set S C {0, l} d , where || , u|| 1 < m holds for all v £ S. At the 
same time, the environment fixes a loss vector £t £ [0, l] d and the learner suffers loss Vf£t- The 
goal of the learner is to minimize the cumulative loss Vf £t- As usual in the literature of online 

optimization (Cesa-Bianclii and Lugosi, 2006), we measure the performance of the learner in terms 
of the regret defined as 


Rt = max Y, (V - v) J £t = Y V t £t ~ 

V t= 1 t= 1 


T 

min > v T £+ • 

ves ' 
t =l 


(i) 


that is, the gap between the total loss of the learning algorithm and the best fixed decision in hind¬ 
sight. In the current paper, we focus on the case of non-oblivious (or adaptive) environments, where 
we allow the loss vector £ t to depend on the previous decisions Vi,..., V-i in an arbitrary fashion. 
Since it is well-known that no deterministic algorithm can achieve sublinear regret under such weak 
assumptions, we will consider learning algorithms that choose their decisions in a randomized way. 
For such learners, another performance measure that we will study is the expected regret defined as 


T 

Rt = max^E \(V t 

t=l 


v) T £ t ] =E 


V Vf£ t 


min E 

vGS 


The framework described above is general enough to accommodate a number of interesting 
problem instances such as path planning, ranking and matching problems, finding minimum-weight 
spanning trees and cut sets. Accordingly, different versions of this general learning problem have 
drawn considerable attention in the past few years. These versions differ in the amount of information 
made available to the learner after each round t. In the simplest setting, called the full-information 
setting, it is assumed that the learner gets to observe the loss vector £ t regardless of the choice of 
V t . As this assumption does not hold for many practical applications, it is more interesting to study 
the problem under partial-information constraints, meaning that the learner only gets some limited 
feedback based on its own decision. In the current paper, we focus on a more realistic partial- 
information scheme known as semi-bandit feedback (Audibert, Bubeck, and Lugosi, 2014) where the 
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learner only observes the components £t,i of the loss vector for which V t) i = 1, that is, the losses 
associated with the components selected by the learner. 1 

1.2 Related Work 

The most well-known instance of our problem is the multi-armed bandit problem considered in the 
seminal paper of Auer, Cesa-Bianchi, Freund, and Schapire (2002): in each round of this problem, 
the learner has to select one of N arms and minimize regret against the best fixed arm while only 
observing the losses of the chosen arms. In our framework, this setting corresponds to setting d = N 
and m = 1. Among other contributions concerning this problem, Auer et al. propose an algorithm 
called Exp3 (Exploration and Exploitation using Exponential weights) based on constructing loss 
estimates it,i for each component of the loss vector and playing arm i with probability propor¬ 
tional to exp(— r/Y^s—i ^s,i) at time t, where rj > 0 is a parameter of the algorithm, usually called 
the learning rate 2 . This algorithm is essentially a variant of the Exponentially Weighted Average 
(EWA) forecaster (a variant of weighted majority algorithm of Littlestone and Warmuth, 1994, and 
aggregating strategies of Vovk, 1990, also known as Hedge by Freund and Schapire, 1997). Besides 
proving that the expected regret of Exp3 is 0(\/NT log iV), Auer et al. also provide a general lower 
bound of fl(\/7VT) on the regret of any learning algorithm on this particular problem. This lower 
bound was later matched by a variant of the Implicitly Normalized Forecaster (INF) of Audibert and 
Bubeck (2010) by using the same loss estimates in a more refined way. Audibert and Bubeck also 
show bounds of O (\JNT / log N log (N/ <5)) on the regret that hold with probability at least 1 — 5, 
uniformly for any S > 0. 

The most popular example of online learning problems with actual combinatorial structure is 
the shortest path problem first considered by Takimoto and Warmuth (2003) in the full informa¬ 
tion scheme. The same problem was considered by Gyorgy, Linder, Lugosi, and Ottucsak (2007), 
who proposed an algorithm that works with semi-bandit information. Since then, we have come 
a long way in understanding the “price of information” in online combinatorial optimization—see 
Audibert, Bubeck, and Lugosi (2014) for a complete overview of results concerning all of the infor¬ 
mation schemes considered in the current paper. The first algorithm directly targeting general online 
combinatorial optimization problems is due to Koolen, Warmuth, and Kivinen (2010): their method 
named Component Hedge guarantees an optimal regret of 0{m\jT log (d/m)) in the full information 
setting. As later shown by Audibert, Bubeck, and Lugosi (2014), this algorithm is an instance of 
a more general algorithm class known as Online Stochastic Mirror Descent (0SMD). Taking the idea 
one step further, Audibert, Bubeck, and Lugosi (2014) also show that OSMD-based methods can also 
be used for proving expected regret bounds of O(VmdT) for the semi-bandit setting, which is also 
shown to coincide with the minimax regret in this setting. For completeness, we note that the EWA 
forecaster is known to attain an expected regret of 0(rn?l 2 * \JT log (d/m)) in the full information 
case and 0(m^dT log (d/m)) in the semi-bandit case. 

While the results outlined above might suggest that there is absolutely no work left to be done in 
the full information and semi-bandit schemes, we get a different picture if we restrict our attention 
to computationally efficient algorithms. First, note that methods based on exponential weighting of 
each decision vector can only be efficiently implemented for a handful of decision sets S —see Koolen 
et al. (2010) and Cesa-Bianchi and Lugosi (2012) for some examples. Furthermore, as noted by 
Audibert et al. (2014), OSMD-type methods can be efficiently implemented by convex programming if 
the convex hull of the decision set can be described by a polynomial number of constraints. Details of 
such an efficient implementation are worked out by Suehiro, Hatano, Kijima, Takimoto, and Nagano 
(2012), whose algorithm runs in 0(d 6 ) time, which can still be prohibitive in practical applications. 

1 Here, Vt i and it,i are the components of the vectors Vt and it, respectively. 

2 In fact, Auer et al. mix the resulting distribution with a uniform distribution over the arms with probability r/N. 

However, this modification is not needed when one is concerned with the total expected regret, see, e.g., Bubeck 

and Cesa-Bianchi (2012, Section 3.1). 
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While Koolen et al. (2010) list some further examples where OSMD can be implemented efficiently, 
we conclude that there is no general efficient algorithm with near-optimal performance guarantees 
for learning in combinatorial semi-bandits. 

The Follow-the-Perturbed-Leader (FPL) prediction method (first proposed by Hannan, 1957 and 
later rediscovered by Kalai and Vempala, 2005) offers a computationally efficient solution for the 
online combinatorial optimization problem given that the static combinatorial optimization problem 
min „ e< 5 v T £ admits computationally efficient solutions for any £ G R d . The idea underlying FPL is 
very simple: in every round t, the learner draws some random perturbations Z t G R d and selects 
the action that minimizes the perturbed total losses: 

ft - 1 

V t = arg min v T £ s - Z t 
5 

Despite its conceptual simplicity and computational efficiency, FPL have been relatively overlooked 
until very recently, due to two main reasons: 

• The best known bound for FPL in the full information setting is 0(mVdT ), which is worse 
than the bounds for both EWA and OSMD that scale only logarithmically with d. 

• Considering bandit information, no efficient FPL-style algorithm is known to achieve a regret of 
O(Vt) . On one hand, it is relatively straightforward to prove 0(T 2 / 3 ) bounds on the expected 
regret for an efficient FPL-variant (see, e.g., Awerbuch and Kleinberg, 2004 and McMahan and 
Blum, 2004). Poland (2005) proved bounds of 0(V NT log N) in the Warmed bandit setting, 
however, the proposed algorithm requires 0(T 2 ) numerical operations per round. 

The main obstacle for constructing a computationally efficient FPL-variant that works with partial 
information is precisely the lack of closed-form expressions for importance weights. In the current 
paper, we address the above two issues and show that an efficient FPL-based algorithm using inde¬ 
pendent exponentially distributed perturbations can achieve as good performance guarantees as EWA 
in online combinatorial optimization. 

Our work contributes to a new wave of positive results concerning FPL. Besides the reservations 
towards FPL mentioned above, the reputation of FPL has been also suffering from the fact that the 
nature of regularization arising from perturbations is not as well-understood as the explicit regu¬ 
larization schemes underlying OSMD or EWA. Very recently, Abernethy et al. (2014) have shown that 
FPL implements a form of strongly convex regularization over the convex hull of the decision space. 
Furthermore, Raklrlin et al. (2012) showed that FPL run with a specific perturbation scheme can be 
regarded as a relaxation of the minimax algorithm. Another recently initiated line of work shows 
that intuitive parameter-free variants of FPL can achieve excellent performance in full-information 
settings (Devroye et al., 2013 and Van Erven et al., 2014). 

1.3 Our Results 

In this paper, we propose a loss-estimation scheme called Geometric Resampling to efficiently com¬ 
pute importance weights for the observed components of the loss vector. Building on this technique 
and the FPL principle, resulting in an efficient algorithm for regret minimization under semi-bandit 
feedback. Besides this contribution, our techniques also enable us to improve the best known regret 
bounds of FPL in the full information case. We prove the following results concerning variants of 
our algorithm: 

• a bound of 0(m.\/dT\og(d/m)) on the expected regret under semi-bandit feedback (Theo¬ 
rem 1), 

• a bound of 0{m\J dT log (d/m) + sJmdT log(l/c>)) on the regret that holds with probability at 
least 1 — 5, uniformly for all 5 G (0,1) under semi-bandit feedback (Theorem 2), 
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• a bound of O (rn 3 / 2 ^/T\og(d/m)') on the expected regret under full information (Theorem 13). 

We also show that both of our senri-bandit algorithms access the optimization oracle 0(dT) times 
over T rounds with high probability, increasing the running time only by a factor of d compared to 
the full-information variant. Notably, our results close the gaps between the performance bounds of 
FPL and EWA under both full information and semi-bandit feedback. Table 1 puts our newly proven 
regret bounds into context. 



FPL 

EWA 

0SMD 

Full info regret bound 

m^^/T logji 

m 3 / 2 y/‘T log ^ 

m \/ Tlo si 

Semi-bandit regret bound 

m^/dTlog^ 

m \j dr lo g^ 

y/mdT 

Computationally efficient? 

always 

sometimes 

sometimes 


Table 1: Upper bounds on the regret of various algorithms for online combinatorial optimization, 
up to constant factors. The third row roughly describes the computational efficiency of 
each algorithm—see the text for details. New results are presented in boldface. 


2. Geometric Resampling 

In this section, we introduce the main idea underlying Geometric Resampling in the specific context 
of Warmed bandits where d = N, m = 1 and the learner has access to the basis vectors as 

its decision set S. In this setting, components of the decision vector are referred to as arms. For 
ease of notation, define It as the unique arm such that V),/ t = 1 and Ft -1 as the sigma-algebra 
induced by the learner’s actions and observations up to the end of round t — 1 . Using this notation, 
we define p t<i =F[I t = i\ F t - 1 ]. 

Most bandit algorithms rely on feeding some loss estimates to a sequential prediction algorithm. 
It is commonplace to consider importance-weighted loss estimates of the form 


p* — 
l t,i ~~ 


Hh=i } , 
Pt,i 


"t,i 


( 2 ) 


for all t, i such that p t ,i > 0. It is straightforward to show that l\ i is an unbiased estimate of the loss 




F t -1 


= 0 < 


£ t ,i for all such t, i. Otherwise, when p t p = 0, we set I* ti = 0, which gives E 

To our knowledge, all existing bandit algorithms operating in the non-stochastic setting utilize 
some version of the importance-weighted loss estimates described above. This is a very natural choice 
for algorithms that operate by first computing the probabilities ptp and then sampling I t from the 
resulting distributions. While many algorithms fall into this class (including the Exp3 algorithm 
of Auer et al. (2002), the Green algorithm of Allenberg et al. (2006) and the INF algorithm of 
Audibert and Bubeck (2010), one can think of many other algorithms where the distribution p t is 
specified implicitly and thus importance weights are not readily available. Arguably, FPL is the most 
important online prediction algorithm that operates with implicit distributions that are notoriously 
difficult to compute in closed form. To overcome this difficulty, we propose a different loss estimate 
that can be efficiently computed even when pt is not available for the learner. 

Our estimation procedure dubbed Geometric Resampling (GR) is based on the simple observation 
that, even though p t j t might not be computable in closed form, one can simply generate a geometric 
random variable with expectation 1 /p t ,i t by repeated sampling from p t . Specifically, we propose the 
following procedure to be executed in round t: 
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Geometric Resampling for multi-armed bandits 

1. The learner draws It ~ pt- 

2. For k = 1,2,... 

(a) Draw (fc) ~ p*. 

(b) If J'(jfe) = I t , break. 

3. Let K t = k. 


Observe that K t generated this way is a geometrically distributed random variable given I t and Tt-\. 
Consequently, we have E [K t \Ft-i,h] = I/Pt,i t - We use this property to construct the estimates 






( 3 ) 


for all arms i. We can easily show that the above estimate is unbiased whenever p t .i > 0: 


E 


4i 


Tt -1 


= 


ItA 


T t -\,It = j 


= pt.i E [i t A K t \Ft-i, h = i\ 

= PiA^tA^ [Kt \Ft-i, It = i] 
= ItA- 


Notice that the above procedure produces i t A = 0 almost surely whenever p tji = 0, giving E y. t A | Ft-i 
0 for such t, i. 

One practical concern with the above sampling procedure is that its worst-case running time is 
unbounded: while the expected number of necessary samples K t is clearly N, the actual number of 
samples might be much larger. In the next section, we offer a remedy to this problem, as well as 
generalize the approach to work in the combinatorial semi-bandit case. 


3. An Efficient Algorithm for Combinatorial Semi-Bandits 

In this section, we present our main result: an efficient reduction from offline to online combinatorial 
optimization under semi-bandit feedback. The most critical element in our technique is extending the 
Geometric Resampling idea to the case of combinatorial action sets. For defining the procedure, let 
us assume that we are running a randomized algorithm mapping histories to probability distributions 
over the action set S: letting Tt -i denote the sigma-algebra induced by the history of interaction 
between the learner and the environment, the algorithm picks action v £ S with probability pt(v) = 
P[E = v\Tt-i]. Also introducing q tti = E [V t A \K t -i ], we can define the counterpart of the standard 
importance-weighted loss estimates of Equation 2 as the vector t* t with components 


Ha 


VtA, 

ItA 




( 4 ) 


Again, the problem with these estimates is that for many algorithms of practical interest, the impor¬ 
tance weights qtA cannot be computed in closed form. We now extend the Geometric Resampling 
procedure defined in the previous section to estimate the importance weights in an efficient man¬ 
ner. One adjustment we make to the procedure presented in the previous section is capping off 
the number of samples at some finite M > 0. While this capping obviously introduces some bias, 
we will show later that for appropriate values of M, this bias does not hurt the performance of 
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the overall learning algorithm too much. Thus, we define the Geometric Resampling procedure for 
combinatorial semi-bandits as follows: 

Geometric Resampling for combinatorial semi-bandits 

1. The learner draws V t ~ pt- 

2. For k = 1,2,..., M, draw V t '(k) ~ p t . 

3. For i = 1,2,..., d, 

K t ,i = min({fc : V t ' ti (k) = l} U {M}). 


Based on the random variables output by the GR procedure, we construct our loss-estimate vector 
£ t G with components 

Z,i = K t>i Vt,iit,i (5) 


for all i = 1,2,... ,d. Since V).,; are nonzero only for coordinates for which t t ,i is observed, these 
estimates are well-defined. It also follows that the sampling procedure can be terminated once for 
every i with Vt,i = 1, there is a copy V t '(k) such that V/^k) = 1 . 

Now everything is ready to define our algorithm: FPL+GR, standing for Follow-the-Perturbed- 
Leader with Geometric Resampling. Defining L t = £ s , at time step t FPL+GR draws the 

components of the perturbation vector Zt independently from a standard exponential distribution 
and selects action 3 

V t = argminid (r/Lt-i - Z t ) , ( 6 ) 

vGS V ' 

where 77 > 0 is a parameter of the algorithm. As we mentioned earlier, the distribution p t , while 
implicitly specified by Z t and the estimated cumulative losses L t _ 1 , cannot usually be expressed in 
closed form for FPL. 4 However, sampling the actions V t '(-) can be carried out by drawing additional 
perturbation vectors Z' t {-) independently from the same distribution as Zt and then solving a linear 
optimization task. We emphasize that the above additional actions are never actually played by the 
algorithm , but are only necessary for constructing the loss estimates. The power of FPL+GR is that, 
unlike other algorithms for combinatorial semi-bandits, its implementation only requires access to a 
linear optimization oracle over S. We point the reader to Section 3.2 for a more detailed discussion 
of the running time of FPL+GR. Pseudocode for FPL+GR is shown on as Algorithm 1. 

As we will show shortly, FPL+GR as defined above comes with strong performance guarantees that 
hold in expectation. One can think of several possible ways to robustify FPL+GR so that it provides 
bounds that hold with high probability. One possible path is to follow Auer et al. (2002) and define 
the loss-estimate vector with components 


H,i 



qt,i 


for some j3 > 0. The obvious problem with this definition is that it requires perfect knowledge of 
the importance weights q t y for all i. While it is possible to extend Geometric Resampling developed 
in the previous sections to construct a reliable proxy to the above loss estimate, there are several 
downsides to this approach. First, observe that one would need to obtain estimates of 1 /q t j for every 
single i —even for the ones for which Vt,i = 0. Due to this necessity, there is no hope to terminate 

3 By the definition of the perturbation distribution, the minimum is unique almost surely. 

4 One notable exception is when the perturbations are drawn independently from standard Gumbel distributions, 
and the decision set is the d-dimensional simplex: in this case, FPL is known to be equivalent with EWA—see, e.g., 
Abernethy et al. (2014) for further discussion. 
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Algorithm 1: FPL+GR implemented with a waiting list. The notation aob stands for elemen¬ 
twise product of vectors a and b: (a o b)i = aibi for all i. 


Input: S C {0, l} d , q £ R + , M £ Z + ; 

Initialization: I = 0e R d ; 


for t=l,..., T do 

Draw Z £M. d with independent components Zi ~ Exp(l); 


Choose action V = argnrin ju T (qL — Z^ j; 


/* Follow the perturbed leader 


K = 0; r = V; /* Initialize waiting list and counters 

for k=l,..*$M do /* Geometric Resampling 

K = K + r; /* Increment counter 

Draw Z' £ with independent components Z[ ~ Exp(l); 

V' = argmin |u T (^qL — Z'^j 


■uG«S 

r = r o V'; 
if r = 0 then break; 
end 

L = L + KoVo£- 

end 


/* Sample a copy of V 


/* Update waiting list 
/* All indices recurred 

/* Update cumulative loss estimates 


*/ 

*/ 

*/ 

*/ 


*/ 

*/ 

*/ 

*/ 


the sampling procedure in reasonable time. Second, reliable estimation requires multiple samples of 
Kt.ii where the sample size has to explicitly depend on the desired confidence level. 

Thus, we follow a different path: Motivated by the work of Audibert and Bubeck (2010), we 
propose to use a loss-estimate vector £ t with components of the form 

^ log (l + P fiy) (7) 

with an appropriately chosen /3 > 0. Then, defining L t -\ = £ s , we propose a variant of FPL+GR 

that simply replaces L t - 1 by L t - 1 in the rule (6) for choosing V). We refer to this variant of FPL+GR 
as FPL+GR.P. In the next section, we provide performance guarantees for both algorithms. 


3.1 Performance Guarantees 

Now we are ready to state our main results. Proofs will be presented in Section 4. First, we present 
a performance guarantee for FPL+GR in terms of the expected regret: 


Theorem 1 The expected regret of FPL+GR satisfies 


- m(log W m) + l) jT 

q 1 eM 


under semi-bandit information. In particular, with 

and M = 


q = 


log {d/m) + 1 


2 dT 


VdT 


em.\J2 (log(d/m) + 1) 


the expected regret of FPL+GR is bounded as 


Rt < 3 m\ j 2 dT ( log b 1 ) • 
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Our second main contribution is the following bound on the regret of FPL+GR.P. 

Theorem 2 Fix an arbitrary S > 0. With probability at least 1 — 8, the regret of FPL+GR.P satisfies 

Rt < m(l0S( ^ m) + 1 ) + V log| + 2md^T\og^ + 2mdT^j + 

+ P ^2mTlog| + 2 + 2dT^j + ^ 

+ m\/2{e — 2)Tlog | + ^8Tlog | + V2(e - 2)T. 

In particular, with 


M = 

the regret of FPL+GR.P is bounded as 


m 


P = and d = 


log (d/m) + 1 


dT 


R t <3 my dT ^log — + 1^ + VmdT ^log ^ + 2^j + ^ 2mT log ^ ^^log 
+ 1.2mVTlog^ + \/T ^^8log ^ + 1.2^ + 2^rflog^ ^m\j\og^ 


d_ 

m 


1 + 1 


+ 1 + \fm. 


with probability at least 1 — S. 


3.2 Running Time 

Let us now turn our attention to computational issues. First, we note that the efficiency of FPL- 
type algorithms crucially depends on the availability of an efficient oracle that solves the static 
combinatorial optimization problem of finding argmin vg<s v T £. Computing the running time of the 
full-informat ion variant of FPL is straightforward: assuming that the oracle computes the solution 
to the static problem in 0(f(S)) time, FPL returns its prediction in 0(f(S) + d) time (with the 
d overhead coming from the time necessary to generate the perturbations). Naturally, our loss 
estimation scheme multiplies these computations by the number of samples taken in each round. 
While terminating the estimation procedure after M samples helps in controlling the running time 
with high probability, observe that the naive bound of MT on the number of samples becomes way 
too large when setting M as suggested by Theorems 1 and 2. The next proposition shows that the 
amortized running time of Geometric Resampling remains as low as 0(d) even for large values of 
M. 

Proposition 3 Let St. denote the number of sample actions taken by GR in round t. Then, E [St] < d. 
Also, for any S > 0, 

T 1 

Y, S t < ( e — I)dT + M log - 

t=l 

holds with probability at least 1 — <5. 

Proof For proving the first statement, let us fix a time step t and notice that 

d 

S t = max K t j = max V t ,jK t j < V t jK t 

r.Vt,j =1 j=l,2,...,d 
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Now, observe that E[K t j\Xt-i,V t j\ < 1/E [Vt,j\Xt-i], which gives E [5*] < d, thus proving the 
hrst statement. For the second part, notice that X t = (S t — E [S t \ Jt_i]) is a mart ingale-difference 
sequence with respect to (Xt) with X t < M and with conditional variance 


Var [X t \ X t_r] = E (S t - E [5 t | X t -i]f X t - % < E [S?\X t -y 



r 



r d 


E 

max (VtjKtj) 2 

3 

Xt -1 

< E 

•«** 

(M 

X t _i 


l=i 


< ^ min —-, M j> < dM. 


where we used E \K?A Xt- 1 ] = gt '* . Then, the second statement follows from applying a version 

of Freedman’s inequality due to Beygelzimer et al. (2011) (stated as Lemma 16 in the appendix) 
with B = M and E^ < dMT. ■ 

Notice that choosing M = 0(s/dT) as suggested by Theorems 1 and 2, the above result guarantees 
that the amortized running time of FPL+GR is 0((d + \Jd/T) ■ (f(S) + d)) with high probability. 


4. Analysis 

This section presents the proofs of Theorems 1 and 2. In a didactic attempt, we present statements 
concerning the loss-estimation procedure and the learning algorithm separately: Section 4.1 presents 
various important properties of the loss estimates produced by Geometric Resampling, Section 4.2 
presents general tools for analyzing Follow-the-Perturbed-Leader methods. Finally, Sections 4.3 
and 4.4 put these results together to prove Theorems 1 and 2, respectively. 


4.1 Properties of Geometric Resampling 

The basic idea underlying Geometric Resampling is replacing the importance weights 1 /q t ^ by 
appropriately defined random variables As we have seen earlier (Section 2), running GR with 

M = oo amounts to sampling each I\ t ,i from a geometric distribution with expectation 
yielding an unbiased loss estimate. In practice, one would want to set M to a finite value to ensure 
that the running time of the sampling procedure is bounded. Note however that early termination 
of GR introduces a bias in the loss estimates. This section is mainly concerned with the nature of 
this bias. We emphasize that the statements presented in this section remain valid no matter what 
randomized algorithm generates the actions Vt ■ Our hrst lemma gives an explicit expression on the 
expectation of the loss estimates generated by GR. 

Lemma 4 For all j and t such that qtj > 0, the loss estimates (5) satisfy 


E 




(1- (1 -9tj) M ) hi¬ 


proof Fix any j, t satisfying the condition of the lemma. Setting q = q t j for simplicity, we write 


OO OO 

E (K tJ | Xt-i} = J2 - E - M )( X ~ 

k—1 k—M 

oo oo 

= E fc (l - - (1 - 9) M E c k-M){l - q) k - M -\ 

k—1 k—M 

= (l-d- 0)") f K 1 - = 1 ~ (1 ~ q)M . 

k =1 q 
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The proof is concluded by combining the above with E 



ft -1 




The following lemma shows two important properties of the GR loss estimates (5). Roughly speaking, 
the first of these properties ensure that any learning algorithm relying on these estimates will be 
optimistic in the sense that the loss of any fixed decision will be underestimated in expectation. The 
second property ensures that the learner will not be overly optimistic concerning its own performance. 


Lemma 5 For all v £ S and t, the loss estimates (5) satisfy the following two properties: 


E \v T it Jt-i < v T £ 


E 


J2pt( u ) ( mT £) 


L ugs 


?t~ i 


uGS 


( 8 ) 

(9) 


Proof Fix any v £ S and t. The first property is an immediate consequence of Lemma 4: we 


have that E 
observe that 


any 

v £ S 

it,k 

ft- 1 


< it k for all k, and thus E 


v T £ t 


ft-1 


< v T £ t . For the second statement, 


E 


Z>(«) (« T £) 


u£S 


ft -1 


= ^2 q t ,iE £t,i ft -1 = ^2 qt d (•*■ “ ~ it,i 


i =1 


i—1 


also holds by Lemma 4. To control the bias term Y^iQtji 1 — Qt,i) M , note that q t ,i{ 1 — qt,i) M < 
qt,ie~ Mqt,i ■ By elementary calculations, one can show that f(q) = qe~ Mq takes its maximum at 

q= Ji and thus VtA 1 - 9m) M < ■ 

Our last lemma concerning the loss estimates (5) bounds the conditional variance of the estimated 
loss of the learner. This term plays a key role in the performance analysis of Exp3-style algorithms 
(see, e.g., Auer et al. (2002); Uchiya et al. (2010); Audibert et al. (2014)), as well as in the analysis 
presented in the current paper. 


Lemma 6 For all t, the loss estimates (5) satisfy 


E 


z>(«) ( uT £t ) 

_ uES 


f -1 


< 2 md. 


Before proving the statement, we remark that the conditional variance can be bounded as md for 
the standard (although usually infeasible) loss estimates (4). That is, the above lemma shows that, 
somewhat surprisingly, the variance of our estimates is only twice as large as the variance of the 
standard estimates. 


Proof Fix an arbitrary t. For simplifying notation below, let us introduce V as an independent 


copy of Vt such that 


V = v 


ft-1 


= Pt{v) holds for all v £ S. To begin, observe that for any i 


E [Ki ti ft- r] < 


2 qt,i 

9t,i 


< 




( 10 ) 


11 



Neu and Bartok 


holds, as K t ,i has a truncated geometric law. The statement is proven as 


E 


5 2 p t ( u ) ( mT ^) 


L ucs 


F ,1 


= E 


= E 


e e ( 


*=1 j=i 
d d 


F t -1 


i=i j =i 

(using the definition of £ t ) 


•F, 


t-i 


< E 


< 2E 




d d K 2 + K 2 

EE M ~r 

i=li=l 2 

(using 2 AS < A 2 + B 2 ) 
d ^ d 

E TT" (ViVt,i£t,i) 53 

i —i j = l 


ViV.,-4 


F t _l 


F t _l 


(using symmetry, Eq. (10) and V) < 1) 
< 2md, 



r - 


< 2toE 


Ft -1 


where the last line follows from using || Vt || x < m, ||£ t ||oo < 1, and E [ V(,i| F t _i] =E V) F t ~i 


= Qt,i- 


4.2 General Tools for Analyzing FPL 

In this section, we present the key tools for analyzing the FPL-component of our learning algorithm. 
In some respect, our analysis is a synthesis of previous work on FPL-style methods: we borrow 
several ideas from Poland (2005) and the proof of Corollary 4.5 in Cesa-Bianchi and Lugosi (2006). 
Nevertheless, our analysis is the first to directly target combinatorial settings, and yields the tightest 
known bounds for FPL in this domain. Indeed, the tools developed in this section also permit an 
improvement for FPL in the full-information setting, closing the presumed performance gap between 
FPL and EWA in both the full-information and the semi-bandit settings. The statements we present 
in this section are not specific to the loss-estimate vectors used by FPL+GR. 

Like most other known work, we study the performance of the learning algorithm through a 
virtual algorithm that (i) uses a time-independent perturbation vector and (ii) is allowed to peek 
one step into the future. Specifically, letting Z be a perturbation vector drawn independently from 
the same distribution as Z\, the virtual algorithm picks its f th action as 

Vt = argmin jV (r)L t - Z^j j . (11) 


In what follows, we will crucially use that V t and Vt+i are conditionally independent and identically 
distributed given F t . In particular, introducing the notations 


qt,i = E [Vt,j| F t -\] 
Ptiv) =P[Vi =v\F t - 1 ] 


q t ,i = E [V M | Ft 

Pt( v ) = w\ v t = v\Ft\, 
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we will exploit the above property by using q t .i = qt-i,i and pt{v) = pt-i{v) numerous times in the 
proofs below. 

First, we show a regret bound on the virtual algorithm that plays the action sequence Vi, V 2 ,..., Vt- 
Lemma 7 For any v £ S, 

SSftW ( ( ,-o T ?,)< m(log(,i „ /m) + 1) . ( 12 ) 

t —1 u^S 

Although the proof of this statement is rather standard, we include it for completeness. We also note 
that the lemma slightly improves other known results by replacing the usual log d term by log(d/m). 
Proof Fix any Using Lemma 3.1 of Cesa-Bianchi and Lugosi (2006) (sometimes referred to 

as the “follow-the-leader/be-the-leader” lemma) for the sequence (r/ii — Z. r\t 2 , ..., r\tr) , we obtain 

T T 

V Y -VfZ<r,Y v T it -V T Z. 

t=1 t =1 


Reordering and integrating both sides with respect to the distribution of Z gives 

t r _ T J] 

vYY Pt ^ ({u-v) T i t ^ <E (Ui-v) Z . 


t=l 


(13) 


The statement follows from using E 


V[Z 


< m(log(d/m) + 1), which is proven in Appendix A as 


Lemma 14, noting that Z is upper-bounded by the sum of the m largest components of Z. ■ 


The next lemma relates the performance of the virtual algorithm to the actual performance of FPL. 
The lemma relies on a “sparse-loss” trick similar to the trick used in the proof Corollary 4.5 in 
Cesa-Bianchi and Lugosi (2006), and is also related to the “unit rule” discussed by Koolen et al. 
( 2010 ). 


Lemma 8 For all t = 1,2,... ,T, assume that it. is such that £t,k > 0 for all k £ {1,2,..., rf}. 
Then, 

( pt ( u ) - pt( u )) (“ T ^t) < v Y pt ( mT ^) • 

uES 


Proof Fix an arbitrary t and u G S, and define the “sparse loss vector” £ t (u) with components 
it,ki u ) = u k^t,k and 


V t (u) = argmin ) v T ( pL t ~ 1 + pi t (u) — Z 
ves L v 


)}■ 


Using the notation p t (u) = P[V) (u) = u\ Ff\, we show in Lennna 15 (stated and proved in 
Appendix A) that pf(u) <pt{u). Also, define 


U{z) = argmin jV (pL t -1 - z'j |. 
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Letting f(z) = e H z Hi (z £ R+) be the density of the perturbations, we have 

Pt{u)= J 1 { u(z)=u}f(z)dz 

z(z[ 0,oo] d 

_ g^ll^t (m) ||i J 1 {U ( z )= u} f(z + r}£t(u)jdz 

Z(z[0,co] d 

= e"IK W H 1 Jj-.J 1 {u{t -r£ (n) )=u } f(z)dz 

<e 4t ' (u)l1 ' J 1 {u{ z - v I- {u) )= u }f^) dz 

z(z[ 0,oo] d 

Now notice that ||i^"(w)|| = u T €[(u) = u'i t , which gives 

Pt(u) > p t {u)e~ rluT ^ t > p t (u ) (l - riu T t^j . 

The proof is concluded by repeating the same argument for all u £ S, reordering and summing the 
terms as 

&*(«) ( mT £) ^ Pt(u) (w T ?t) + V ^2 Pt{u) (u T £t^j . ( 14 ) 

u£S u(zS u(^S 


4.3 Proof of Theorem 1 

Now, everything is ready to prove the bound on the expected regret of FPL+GR. Let us fix an arbitrary 
v £ S. By putting together Lemmas 6, 7 and 8, we immediately obtain 

E f^p,(«)((«-u) T £) < m{l0g{d/m) + 1) +2 V mdT, (15) 

_t=l ues J ^ 

leaving us with the problem of upper bounding the expected regret in terms of the left-hand side of 
the above inequality. This can be done by using the properties of the loss estimates (5) stated in 
Lemma 5: 

~ T ITT 1 

E J2(V-v) T e t < e ££>(«) ((«-«) T £) +^. 

.t—1 _ _t—1 u^S 

Putting the two inequalities together proves the theorem. 

4.4 Proof of Theorem 2 

We now turn to prove a bound on the regret of FPL+GR.P that holds with high probability. We 
begin by noting that the conditions of Lemmas 7 and 8 continue to hold for the new loss estimates, 
so we can obtain the central terms in the regret: 

£ £ Mu) ((« - v fl) < m(losW ” l) + 1) + „£ £ p, («) (Ml,) 2 . 

t—1 u£S ^ t—1 u£S 
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The first challenge posed by the above expression is relating the left-hand side to the true regret 
with high probability. Once this is done, the remaining challenge is to bound the second term on 
the right-hand side, as well as the other terms arising from the first step. We first show that the 
loss estimates used by FPL+GR. P consistently underestimate the true losses with high probability. 

Lemma 9 Fix any S' > 0. For any v £ S, 

V’ (l T - Lr) < nAsAA 

holds with probability at least 1 — S'. 

The simple proof is directly inspired by Appendix C.9 of Audibert and Bubeck (2010). 

Proof Fix any t and i. Then, 

E exp Ft -i =E exp (log (l + /3£ t ,i^ Ft -i < 1 + /3£ t ,i < exp(/3£ t>i ), 

where we used Lemma 4 in the first inequality and l + z < e z that holds for all 2 G R. Asa result, the 
process W t = exp^/?(L iji — is a supermartingale with respect to (Ft): E[Wt| Ft-i] < Wt-i- 

Observe that, since Wq = 1, this implies E [W t ] < E [M 7 t _i] 1. Applying Markov’s inequality 

gives that 

E Lr,i > Lr,i + £ = P Lt,z — Lt,i > £ 

< E exp (j3 (Z T ,i - fr.tjj exp(-^e) < exp(-/3e) 

holds for any e > 0. The statement of the lemma follows after using || , w|| 1 < m, applying the union 
bound for all i. and solving for e. ■ 

The following lemma states another key property of the loss estimates. 

Lemma 10 For any t, 

d d q d 

2 =1 2=1 2=1 

Proof The statement follows trivially from the inequality log(l + z) > z — that holds for all 
2 > 0. In particular, for any fixed t and *, we have 

log + 

Multiplying both sides by qt,i/ P and summing for all i proves the statement. ■ 

The next lemma relates the total loss of the learner to its total estimated losses. 

Lemma 11 Fix any S' > 0. With probability at least 1 — 2 S', 

]T V t T £ t (“ T &) + + \/2(e-2)T (to log ^ + l) + ^8Tlogi 

t =1 t =1 uG<S ' ' 

Proof We start by rewriting 

d 

J>(«) (u T £t^ = ^qt^Kt^Vt^i- 

U^S 2=1 
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Now let = E [ K t ,i\Tt- 1 ] for all i and notice that 

d 

Xt = 'y ( {kt,i 

i=l 

is a martingale-difference sequence with respect to (J-'t) with elements upper-bounded by to (as 
Lennna 4 implies kt^qt,i < 1 and ||"VJ|| x < to). Furthermore, the conditional variance of the incre¬ 
ments is bounded as 



' ( d \ 2 



Var[X t | Tt-!} <E 

n2qt,iVt,i K t,i) 

T t -1 

< E 


iS,i (t^ K i 

i=i \*=l 


Ft -1 


< 2?n, 


where the second inequality is Cauchy-Scliwarz and the third one follows from E [ Kf^ Tt~i\ < 2/9to 
and ||V* || x < to. Thus, applying Lennna 16 with B = m and T,t < 2mT we get that for any 

S' > my/log £/(e- 2), 

E E qt,dt,iVt,i (kt,i - K t ,i ) < y^(e — 2) log ^ + S' 


holds with probability at least 1—<5', where we have used || < to. After setting S = m^TT log 

we obtain that 

T d _ / i \ 

E E qt,dt,iV t ,i ( k t ,i - < \/2 (e - 2)T ( ?n log — + 1J (16) 

t=li=i \ 2 

holds with probability at least 1 — 5'. 

To proceed, observe that qt.,ikt,i = 1 — (1 — qt,i) M holds by Lennna 4, implying 


E 'It.XtJtJu.; > V t T £ t - E v t,iO- - 


Together with Eq. (16), this gives 


J2 V t T£t ^EE r W (« t £) + V 2 (e~ 2 )T f to log ~ Til +EE^- i ( 1 - ■ 


t =1 u£S 

Finally, we use that, by Lemma 5, (1 — qt,i) M < l/(eM), and 

d 


T d 


t =1 i=1 


Y t = E ( v t,i - Qt,i) (1 - <Zm 




is a martingale-difference sequence with respect to (J~t) with increments bounded in [—1,1]. Then, 
by an application of Hoeffding-Azuma inequality, we have 


T d 


EE^(i-aM) M <^ + V 8T iogj7 

t —1 i=l 1 

with probability at least 1 — <5 / , thus proving the lemma. ■ 

Finally, our last lemma in this section bounds the second-order terms arising from Lemmas 8 and 10. 
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Lemma 12 Fix any S' > 0. With probability at least 1 — 26', the following hold simultaneously: 
EE Pt{v) ^ v T It'j < Mm\J 2 T log + 2md^Tlog + 2 mdT 


t=ines 


T d 


t=1 i=l 


EE < My 2mT log ^ + 2d\jT\og ^ + 2dT. 


Proof First, recall that 


E 


E^w ( wT ^) 


vG<S 


^ t -i 


< 2md 


holds by Lemma 8 . Now, observe that 

Xt = PtH (( 'fit ) 2 - E (v T lt)' 


vGS 


Tt-i 


is a martingale-difference sequence with increments in [— 2md, inM\. An application of the Hoeffding- 
Azurna inequality gives that 


£=1 v£S 


(( ~\ 2 

r / 2 

" 

\ / 1 / 

-e 

[(»T,) 

T t -1 

j < Mrny 2Tlog — + 2 mdJ'. 


i 


holds with probability at least 1 — S'. Reordering the terms completes the proof of the first statement. 
The second statement is proven analogously, building on the bound 


E 


E 




Tt -1 


<E 




i-1 


T t -1 


< 2d. 


Theorem 2 follows from combining Lemmas 9 through 12 and applying the union bound. 


5. Improved Bounds for Learning With Full Information 

Our proof techniques presented in Section 4.2 also enable us to tighten the guarantees for FPL in the 
full information setting. In particular, consider the algorithm choosing action 

V t = argmin v T - Z t ), 

vdS 

where L t = XEi and the components of Z t are drawn independently from a standard exponen¬ 
tial distribution. We state our improved regret bounds concerning this algorithm in the following 
theorem. 


Theorem 13 For any v £ S, the total expected regret of FPL satisfies 

R T < ”‘( 1 °8W”») + 1) + , m £ E W t,\ 


V 


t =l 


under full information. In particular, defining LLf = min „ 6 5 v t Lt and setting 


77 = mm • 


/log(d/m) + 1 1 
Tf ’ 2 
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the regret of FPL satisfies 


Rt < 4m max 





In the worst case, the above bound becomes 2m 3 / 2 ^/T(log((i/m) + l), which improves the best 

known bound for FPL of Kalai and Vempala (2005) by a factor of \Jd/m. 

Proof The first statement follows from combining Lemmas 7 and 8, and bounding 

N N 

^2 p t (u)(u T £ t ) 2 < m ^2 Pt ( u )(u T £ t ), 

u£S iz£«S 


while the second one follows from standard algebraic manipulations. 


6. Conclusions and Open Problems 

In this paper, we have described the first general and efficient algorithm for online combinato¬ 
rial optimization under semi-bandit feedback. We have proved that the regret of this algorithm is 
0(m^/dT\og(d/m )) in this setting, and have also shown that FPL can achieve 0(m 3 / 2 \jT\og(d/m)) 
in the full information case when tuned properly. While these bounds are off by a factor of 
\Jm\og(d/m) and yfm, from the respective minimax results, they exactly match the best known 
regret bounds for the well-studied Exponentially Weighted Forecaster (EWA). Whether the remaining 
gaps can be closed for FPL-style algorithms (e.g., by using more intricate perturbation schemes or 
a more refined analysis) remains an important open question. Nevertheless, we regard our contri¬ 
bution as a significant step towards understanding the inherent trade-offs between computational 
efficiency and performance guarantees in online combinatorial optimization and, more generally, in 
online optimization. 

The efficiency of our method rests on a novel loss estimation method called Geometric Resampling 
(GR). This estimation method is not specific to the proposed learning algorithm. While GR has 
no immediate benefits for QSMD-type algorithms where the ideal importance weights are readily 
available, it is possible to think about problem instances where EWA can be efficiently implemented 
while importance weights are difficult to compute. 

The most important open problem left is the case of efficient online linear optimization with 
full bandit feedback where the learner only observes the inner product V t T £t hr round t. Learning 
algorithms for this problem usually require that the (pseudo-)inverse of the covariance matrix P t = 
E [ V)V) T | Ft- 1 ] is readily available for the learner at each time step (see, e.g., McMahan and Blum 
(2004); Dani et al. (2008); Cesa-Bianchi and Lugosi (2012); Bubeck et al. (2012)). Computing 
this matrix, however, is at least as challenging as computing the individual importance weights 
1 /qt,i- That said, our Geometric Resampling technique can be directly generalized to this setting by 
observing that the matrix geometric series ~Pt) n converges to Pf 1 under certain conditions. 

This sum can then be efficiently estimated by sampling independent copies of Vj, which paves the 
path for constructing low-bias estimates of the loss vectors. While it seems straightforward to go 
ahead and use these estimates in tandem with FPL, we have to note that the analysis presented in 
this paper does not carry through directly in this case. The main limitation is that our techniques 
only apply for loss vectors with non-negative elements (cf. Lemma 8). Nevertheless, we believe 
that Geometric Resampling should be a crucial component for constructing truly effective learning 
algorithms for this important problem. 
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Appendix A. Further Proofs and Technical Tools 

Lemma 14 Let Z \,..., Z& be i.i.d. exponentially distributed random variables with unit expectation 
and let Z *,..., be their permutation such that Z\ > > ■ ■ ■ > Z£. Then, for any 1 < m < d, 


E 


i=l 


< m ( log — +1 . 


Proof Let us define Y = Yl^Li - Then, as Y is nonnegative, we have for any A > 0 that 


POO 

E [Y] = / P [Y > y] dy 

Jo 

poo ra 

p J2 z *>y 


< A+j 

s a+ 1 
<A d 


dy 


zi > JL 

m. 


Z\ > — 
mi 


dy 

dy 


=A + de~ A/m 
1 (d\ 

<m log — + m 

\m 


where in the last step, we used that A = log (^) minimizes A + de A ! m over the real line. ■ 

Lemma 15 Fix any v £ S and any vectors L £ R d and £ £ [0,oo) d . Define the vector £! with 
components £' k = Vk£k ■ Then, for any perturbation vector Z with independent components, 

P [v T (L + £' - Z) < u T (L +£' — Z) {Mu € S)} 

< P [v r (L + £ - Z) < u T (L + £ - Z) {Mu S S)}. 


Proof Fix any u £ S \ {u} and define the vector £" =£ — £'. Define the events 

A\u) = (u T {L +£' — Z) < u T (L + £’ — Z)} 

and 

A(u) = {v T (L + £ - Z) < u T {L + £ - Z )} . 

We have 

A'{u) = |(u — u) T Z > (v — u) T (L + £')} 

C {{v-u ) 1 Z> {v - u) J {L + £') - u T £"} 

= {(u — «) T Z > {v — u) T (L + £)} = A(u), 
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where we used v T l" = 0 and u T l" > 0. Now, since A'[u) C A(u), we have Pluesbl'('u) U ^uGsA(u) , 
thus proving P [n uei sA'(w)] < P [flue5^( M )] as claimed in the lennna. ■ 


Lemma 16 (cf. Theorem 1 in Beygelzimer et al. (2011)) Assume X\, Xi, ■. ■, Xt is a martingale- 
difference sequence with respect to the filtration (Xt) with X t < B for 1 < t < T. Let of = 

Var [X t \ Xt- 1 ] and Tf = ^* =1 °f • Then, for any 5 > 0, 


P 



< 8 . 


Furthermore, for any S > B i/log(l/ S))(e — 2), 


5 > 


> 


(e _ 2 )log i(a + s 


< <5. 
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